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Abstract 

We consider stochastic sequential learning prob¬ 
lems where the learner can observe the average 
reward of several actions. Such a setting is inter¬ 
esting in many applications involving monitoring 
and surveillance, where the set of the actions to 
observe represent some (geographical) area. The 
importance of this setting is that in these appli¬ 
cations, it is actually cheaper to observe average 
reward of a group of actions rather than the re¬ 
ward of a single action. We show that when the 
reward is smooth over a given graph represent¬ 
ing the neighboring actions, we can maximize 
the cumulative reward of learning while minimiz¬ 
ing the sensing cost. In this paper we propose 
CheapUCB, an algorithm that matches the regret 
guarantees of the known algorithms for this set¬ 
ting and at the same time guarantees a linear cost 
again over them. As a by-product of our analy¬ 
sis, we establish a U(\/^) lower bound on the 
cumulative regret of spectral bandits for a class 
of graphs with effective dimension d. 

1. Introduction 

In many online learning and bandit problems, the learner is 
asked to select a single action for which it obtains a (pos¬ 
sibly contextual) feedback. However, in many scenarios 
such as surveillance, monitoring and exploration of a large 
area or network, it is often cheaper to obtain an average re¬ 
ward for a group of actions rather than a reward for a single 
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one. In this paper, we therefore study group actions and 
formalize this setting as cheap bandits on graph structured 
data. Nodes and edges in our graph model the geomet¬ 
ric structure of the data and we associate signals (rewards) 
with each node. We are interested in problems where the 
actions are a collection of nodes. Our objective is to locate 
nodes with largest rewards. 

The cost-aspect of our problem arises in sensor net¬ 
works (SNETs) for target localization and identification. 
In SNETs sensors have limited sensing range (Ermis & 
Saligrama, 2010; 2005 )and can reliably sense/identify tar¬ 
gets only in their vicinity. To conserve battery power, 
sleep/awake scheduling is used (Euemmeler & Veeravalli, 
2008; Aeron et al., 2008), wherein a group of sensors is wo¬ 
ken up sequentially based on probable locations of target. 
The group of sensors minimize transmit energy through 
coherent beamforming of sensed signal, which is then re¬ 
ceived as an average reward/signal at the receiver. While 
coherent beam forming is cheaper, it nevertheless increases 
target ambiguity since the sensed field degrades with dis¬ 
tance from target. A similar scenario arises in aerial recon¬ 
naissance as well: Larger areas can be surveilled at higher 
altitudes more quickly (cheaper) but at the cost of more tar¬ 
get ambiguity. 

Moreover, sensing average rewards through group actions, 
in the initial phases, is also meaningful. Rewards in many 
applications are typically smooth band-limited graph sig¬ 
nals (Narang et al., 2013) with the sensing field decay¬ 
ing smoothly with distance from the target. In addition 
to SNETs (Zhu & Rabbat, 2012), smooth graph signals 
also arise in social networks (Girvan & Newman, 2002), 
and recommender systems. Signals on graphs is an emerg¬ 
ing area in signal processing (SP) but the emphasis is on 
reconstruction through sampling and interpolation from a 
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small subset of nodes (Shuman et al., 2013). In contrast, 
our goal is in locating the maxima of graph signals rather 
than reconstruction. Nevertheless, SP does provide us with 
the key insight that whenever the graph signal is smooth, 
we can obtain information about a location by sampling its 
neighborhood. 

Our approach is to sequentially discover the nodes with op¬ 
timal reward. We model this problem as an instance of lin¬ 
ear bandits (Auer, 2002; Dani et al., 2008; Li et al., 2010) 
that links the reward of nodes through an unknown param¬ 
eter. A bandit setting for smooth signals was recently stud¬ 
ied by Valko et al. (2014), however neglecting the signal 
cost. While typically bandit algorithms aim to minimize 
the regret, we aim to minimize both regret and the signal 
cost. Nevertheless, we do not want to tradeoff the regret 
for cost. In particular, we are not compromising regret for 
cost, neither we seek a Pareto frontier of two objectives. We 
seek algorithms that minimize the cost of sensing and at the 
same time attain, the state-of-the-art regret guarantees. 

Notice that our setting directly generalizes the traditional 
setting with single action per time step as the arms them¬ 
selves are graph signals. We define cost of each arm 
in terms of their graph Fourier transform. The cost is 
quadratic in nature and assigns higher cost to arms that 
collect average information from a smaller set of neigh¬ 
bors. Our goal is to collect higher reward from the nodes 
while keeping the total cost small. However, there is a 
tradeoff in choosing low cost signals and higher reward col¬ 
lection: The arms collecting reward from individual nodes 
cost more, but give more specific information about node’s 
reward and hence provide better estimates. On other hand, 
arms that collect average reward from subset of its neigh¬ 
bors cost less, but only give crude estimate of the reward 
function. In this paper, we develop an algorithm maximiz¬ 
ing the reward collection while keeping the cost low. 

2. Related Work 

There are several other bandit and online learning settings 
that consider costs (Tran-Thanh et al., 2012; Badanidiyuru 
et al., 2013; Ding et al., 2013; Badanidiyuru et al., 2014; 
Zolghadr et al., 2013; Cesa-Bianchi et al., 2013a). The 
first set is referred to as budgeted bandits (Tran-Thanh 
et al., 2012) or bandits with knapsacks (Badanidiyuru et al., 
2013), where each single arm is associated with a cost. This 
cost can be known or unknown (Ding et al., 2013) and can 
depend on a given context (Badanidiyuru et al., 2014). The 
goal there is in general to minimize the regret as a func¬ 
tion of budget instead of time or to minimize regret un¬ 
der budget constraints, where there is no advantage in not 
spending all the budget. Our goal is different as we care 
both about minimizing the budget and minimizing the re¬ 
gret as a function of time. Another cost setting considers 


cost for observing features from which the learner can build 
its prediction (Zolghadr et al., 2013). This is different from 
our consideration of cost, which is inversely proportional to 
the sensing area. Finally, in the adversarial setting (Cesa- 
Bianchi et al., 2013a), considers cost for switching actions. 

The most related graph bandits setting to ours is by Valko 
et al. (2014) on which we build this paper. Another graph 
bandit setting considers side information, when the learner 
obtains besides the reward of the node it chooses, also the 
rewards of the neighbors (Mannor & Shamir, 2011; Alon 
et al., 2013; Caron et al., 2012; Kocak et al., 2014). Finally 
a different graph bandit setup is gang of (multiple) bandits 
considered in (Cesa-Bianchi et al., 2013b) and online clus¬ 
tering of bandits in (Gentile et al., 2014). 

Our main contribution is the incorporation of sensing cost 
into learning in linear bandit problems while simultane¬ 
ously minimizing two performance metrics: cumulative re¬ 
gret and the cumulative sensing cost. We develop Chea- 
pUCB, the algorithm that guarantees regret bound of the 
order d\/T, where d is the effective dimension and T is the 
number of rounds. This regret bound is of the same order 
as SpectralUCB (Valko et al., 2014) that does not take cost 
into consideration. However, we show that our algorithm 
provides a cost saving that is linear in T compared to the 
cost of SpectralUCB. The effective dimension d that ap¬ 
pears in the bound is a dimension typically smaller in real- 
world graphs as compared to number of nodes N. This is 
in contrast with linear bandits that can achieve in this graph 
setting the regret of N\/T or \/NT. However, our ideas of 
cheap sensing are directly applicable to the linear bandit 
setting as well. As a by-product of our analysis, we estab¬ 
lish a O(v^dT) lower bound on the cumulative regret for a 
class of graphs with effective dimension d. 

3. Problem Setup 

Let G = (V, f) denote an undirected graph with number 
of nodes |V| = N. We assume that degree of all the nodes 
is bounded by n. Let s : V ^ IZ denote a signal on G, 
and S the set of all possible signals on Q. Let L = D — A 
denote the unnormalized Laplacian of the graph G, where 
A = {oij} is the adjacency matrix and D is the diagonal 
matrix with Da = Oij. We emphasize that our main 
results extend to weighted graphs if we replace the matrix 
A with the edge weight matrix W. We work with matrix A 
for simplicity of exposition. We denote the eigenvalues of 
L as 0 = Al < A 2 < • • • < Aat, and the corresponding 
eigenvectors as qi,q 2 ,'’' ^qN- Equivalently, we write 
L = QAjcQ', where Ac = (im^(Ai, A 2 , • • • ,Xn) and 
Q is the N X N orthonormal matrix with eigenvectors in 
columns. We denote transpose of a as a', and all vectors 
are by default column vectors. For a given matrix V, we 
denote V-norm of a vector a as ||a||y = \/a'Va. 
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3.1. Reward function 


We define the arms as the set 


We define a reward function on a graph G as a linear com¬ 
bination of the eigenvectors. For a given parameter vector 
OL G IZ^, let : V ^ 7^ denote the reward function on 
the nodes defined as 


Sd ■= {<5,,, : w = 1,2, • ■ ■ , A^}. 

Compared to multi-arm and linear bandits, the number of 
arms K is 0{N‘^) and the contexts have dimension N. 


fa = 

The parameter a can be suitably penalized to control the 
smoothness of the reward function. For instance, if we 
choose a such that large coefficients correspond to the 
eigenvectors associated with small eigenvalues then fo, is a 
smooth function of G (Belkin et al., 2008). We denote the 
unknown parameter that defines the true reward function as 
a*. We denote the reward of node i as (i). 

In our setting, the arms are nodes and the subsets of their 
neighbors. When an arm is selected, we observe only the 
average of the rewards of the nodes selected by that arm. 
To make this notion formal, we associate arms with probe 
signals on graphs. 

3.2. Probes 

Let S C |s G [0,1]^ : <^2 = 11 denote the set of 

probes. We use the word probe and action interchangeably. 
A probe is a signal with its width corresponding to the sup¬ 
port of the signal s. For instance, it could correspond to the 
region-of-coverage or region-of-interest probed by a radar 
pulse. Thus each s G 5 is of the form Si = l/supp(s), 
for alH = 1, 2, • • • , A^, where supp(s) denotes the number 
of positive elements in s. The inner product of f and a 
probe s is the average reward of supp(s) number of nodes. 

We parametrize a probe in terms of its width w G [N] and 
let the set of probes of width w to he = {s G S : 
supp(s) = w}. For a given w > 0, our focus in this paper 
is on probes with uniformly weighted components, which 
are limited to neighborhoods of each node on the graph. 
We denote the collection of these probes as C 
which has N elements. We denote the element in asso¬ 
ciated with node i as sf. Suppose node i has neighbors at 
{iiG 2 , • • • jw-i}, then sf is described as: 

{ 1/w if k = i 

l/w if k = ji, i = 1,2,-■ ■ ,w - 1 (1) 

0 otherwise. 

If node i has more than w neighbors, there can be multiple 
ways to define sf depending on the choice of its neigh¬ 
bors. When w is less than degree of node i, in defining sf 
we only consider neighbors with larger edge weights. If 
all the weights are the same, then we select w neighbors 
arbitrarily. Note that for all w. In the follow¬ 

ing we write ‘probing with s’ to mean that s is used to get 
information from nodes of graph G. 


3.3. Cost of probes 

The cost of the arms are defined using the spectral prop¬ 
erties of their associated graph probes. Let s denote the 
graph Fourier transform (GFT) of probe s G 5. Analo¬ 
gous to Fourier transform of a continuous function, GFT 
gives amplitudes associated with graph frequencies. The 
GFT coefficient of a probe on frequency XiG = 
is obtained by projecting it on q^, i.e., 

s = Q's, 


where SiG = 1, 2, • • • , A^ is the GFT coefficient associ¬ 
ated with frequency . Let C : S ^ denote the cost 
function. Then the cost of the probe s is described by 

c'(s)=- Si)^ 


where the summation is over all the unordered node pairs 
{i,j} for which node i is adjacent to node j. We motivate 
this cost function from the SNET perspective where probes 
with large width are relatively cheap. We first observe that 
the cost of a constant probe is zero. For a probe, sf G 5^, 
of width w it follows that^, 


G(sf) 


w — 1 
w‘^ 



1 

w‘^ 


( 2 ) 


Note that the cost of w- width probe associated with node 
i depends only on its width w. For w = 1, C{s}) = 1 for 
alH = 1, 2, • • • ,N. That is, the cost of probing individual 
nodes of the graph is the same. Also note that G(sf) is de¬ 
creasing in w, implying that probing a node is more costly 
than probing a subset of its neighbors. 

Alternatively, we can associate probe costs with eigenval¬ 
ues of the graph Laplacian. Constant probes corresponds to 
the zero eigenvalue of the graph Laplacian. More generally, 
we see that. 


N 

G(s) = - Sj)‘^ = s'Cs = ^ Xis‘f = s'A^s. 

i^j i=l 

It follows that G(s) = ||s||£. The operation of pulling an 
arm and observing a reward is equivalent to probing the 

^We symmetrized the graph by adding self loops to all the 
nodes to make their degree (number of neighbors) A, and nor¬ 
malized the cost by N. 
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graph with a probe. This results in a value that is the inner 
product of the probe signal and graph reward function. We 
write the reward in the probe space Sd sls follows. Let 
Fg : S ^ 1Z defined as 

Fg{s) = s'Qa* = s'a* 

denote the reward obtained from probe s. Thus, each arm 
gives a reward that is linear, and has quadratic cost, in its 
GFT coefficients. In terms of the linear bandit terminology, 
the GFT coefficients in Sd constitute the set of arms. 

With the rewards defined in terms of the probes, the opti¬ 
mization of reward function is over the action space. Let 
s* = argmaxsG^i) Fg{s) denote the probe that gives the 
maximum reward. This is a straightforward linear opti¬ 
mization problem if the function parameter a* is known. 
When a* is unknown we can learn the function through a 
sequence of measurements. 


node actions. We then develop an algorithm that aims to 
achieve the same order of regret using group actions and 
reducing the total sensing cost. 

4. Node Actions: Spectral Bandits 

If we restrict the action set to Sd = : i = 1, 2, • • • , n}, 

where denotes a binary vector with component set 
to 1 and all the other components set to 0, then only node 
actions are allowed in each step. In this setting, the cost is 
the same for all the actions, i.e., C(ei) = 1 for all i. 

Using these node actions, Valko et al. (2014) developed 
SpectralUCB that aims to minimize the regret under the as¬ 
sumption that the reward function is smooth. The smooth¬ 
ness condition is characterized as follows: 

3 c>0 such that ||a*||A < c. (5) 


3.4. Learning setting and performance metrics 

Our learning setting is the following. The learner uses a 
policy TT : ,T} —> Sd that assigns at step t <T, 

probe 7r(t). In each step t, the recommender incurs a cost 
C{7r{t)) and obtains a noisy reward such that 

n = FaiiTit)) +et, 


Here A = Ac + A/, and A > 0 is used to make Ac 
invertible. The bound c characterizes the smoothness of 
the reward. When c is small, the rewards on the neighbor¬ 
ing nodes are more similar. In particular, when the reward 
function is a constant, then c = 0. To characterize the re¬ 
gret performance of SpectralUCB, Valko et al. (2014) intro¬ 
duced the notion of effective dimension defined as follows: 


where St is independent R-sub Gaussian for any t. 
The cumulative regret of policy tt is defined as 

T 

Rt = TFG(s*)-^FG(7r(i)) 

t=l 


Definition 1 (Effective dimension) For graph G, let us 
denote A = Ai < A 2 • • • < \n the diagonal elements of A. 
Given T, effective dimension is the largest d such that: 


and the total cost incurred up to time T is given by 

T 

CT = ^C(7r(i)). (4) 

t=l 

The goal of the learner is to learn a policy tt that minimizes 
total cost Ct while keeping the cumulative (pseudo) regret 
Rt as low as possible. 


Theorem 1 (Valko et al, 2014) The cumulative regret of 
SpectralUCB is bounded with probability at least 1 — 6 as: 

Rt < (sRy^d\og{l + T/A) -f 21og(l/J) + 4c^ 

Xv/dTlog(H-T/A), 

Lemma 1 The total cost of the SpectralUCB is Ct = T. 


Node vs. Group actions: The set Sd allows actions that 
can probe a node (node-action) or a subset of nodes (group- 
action). Though the group actions have smaller cost, they 
only provide average reward information for the selected 
nodes. In contrast, node actions provide crisper informa¬ 
tion of the reward for the selected node, but at a cost pre¬ 
mium. Thus, an algorithm that uses only node actions can 
provide a better regret performance compared to the one 
that takes group actions. But if the algorithms use only 
node actions, the cumulative cost can be high. 


Note that effective dimension depends on T and also on 
how fast the eigenvalues grow. The regret performance 
of SpectralUCB is good when d is small, which occurs 
when the eigenspectrum exhibits large gaps. For these situ¬ 
ations, SpectralUCB performance has a regret that scales as 
0{dVT) for a large range of values of T. To see this, no¬ 
tice that in relation (6) when Xd^i/Xd is large, the value of 
effective dimension remains unchanged over a large range 
of T implying that the regret bound of 0{ds/T) is valid for 
a large range of values of T with the same d. 


In the following, we first state the regret performance of the There are many graphs for which the effective dimension is 
SpectralUCB algorithm (Valko et al., 2014) that uses only small. For example, random graphs are good expanders for 
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which eigenvalues grow fast. Another setting are stochastic 
block models (Girvan & Newman, 2002), that exhibit large 
eigenvalue gap and are popular in the analysis of social, 
biological, citation, and information networks. 


clique with the highest reward. We then reduce the prob¬ 
lem to the multi-arm case, using Theorem 5.1 of Auer et al. 
(2003) and lower bound the minimax risk. See the supple¬ 
mentary material for a detailed proof. 


5. Group Actions: Cheap Bandits 

Recall (Section 3.3) that group actions are cheaper than the 
node actions. Furthermore, that the cost of group actions 
is decreasing in group size. In this section, we develop 
a learning algorithm that aims to minimize the total cost 
without compromising on the regret using group actions. 
Specifically, given T and a graph with effective dimen¬ 
sion d our objective is as follows: 


min Ct subject to Rt ^ dVT. (7) 

TT 

where optimization is over policies defined on the action 
set Sd given in subsection 3.2. 

5.1. Lower bound 

The action set used in the above optimization problem is 
larger than the set used in the SpectralUCB. This raises 
the question of whether or not the regret order of d\/T is 
too loose particularly when SpectralUCB can realize this 
bound using a much smaller set of probes. 

In this section we derive a \/dT lower bound on the ex¬ 
pected regret (worst-case) for any algorithm using action 
space Sd on graphs with effective dimension d. While this 
implies that our target in (7) should be we follow 

Valko et al. (2014) and develop a variation of SpectralUCB 
that obtains the target regret of d^/T. We leave it as a future 
work to develop an algorithm that meets the target regret of 
^/dT while minimizing the cost. 


5.2. Local smoothness 

In this subsection we show that a smooth reward func¬ 
tion on a graph with low effective dimension implies lo¬ 
cal smoothness of the reward function around each node. 
Specifically, we establish that the average reward around 
the neighborhood of a node provides good information 
about the reward of the node itself. Then, instead of prob¬ 
ing a node, we can use group actions to probe its neighbor¬ 
hood and get good estimates of the reward at low cost. 

From the discussion in Section 4, when d is small and there 
is a large gap between the Xd and SpectralUCB en¬ 
joys a small bound on the regret for a large range of values 
in the interval [{d — l)Ac^, dXd-\-i]. Intuitively, a large gap 
between the eigenvalues implies that there is a good par¬ 
titioning of the graph into tight clusters. Furthermore, the 
smoothness assumption implies that the reward of a node 
and its neighbors within each cluster are similar. 

Let J\fi denote a set of neighbors of node i. The following 
result provides a relation between the reward of node i and 
the average reward from A/i of its neighbors. 

Proposition 2 Let d denote the effective dimension and 
Xd^ri/^d ^ 0{d?). Let a* satisfy (5). For any node i 


/«•(*) 


1 

W\ 


foc-u) 

jeJVi 


< c'd/Xd-\-i 


( 8 ) 


for all Mi, and d = 56/^v^c. 


Let Qd denote a set of graphs with effective dimension d. 
For a given policy tt, a*, T and graph G. Define expected 
cumulative reward as 


Regret{T^ tt, a* 


G) =E 


" T 

- sta* 

_t=l 


OL 


where Sf = ff{f)Q. 

Proposition 1 For any policy tt and time period T, there 
exists a graph G ^ Qd cmd a ex'" G IZ^ representing a 
smooth reward such that 

Regret{T, tt, a*,G) = ^(VdT) 

The proof follows by construction of a graph with d disjoint 
cliques and restricting the rewards to be piecewise constant 
on the cliques. The problem then reduces to identifying the 


The full proof is given in the supplementary material. It is 
based on k-wny expansion constant together with bounds 
on higher order Cheeger inequality (Gharan & Trevisan, 
2014). Note that (8) holds for all i. However, we only 
need this to hold for the node with the optimal reward to 
establish regret performance our algorithm. We rewrite (8) 
for the optimal i* node using group actions as follows: 

\Fg{s.) - Feisff < dd/Xd+i for all u; < \Mi^\. (9) 

Though we give the proof of the above result under the 
technical assumption Xd-\-i/Xd > 0{d?), it holds in cases 
where eigenvalues grow fast. For example, for graphs with 
strong connectivity property this inequality is trivially sat¬ 
isfied. We can show that |Fg'(s*) — ^^(8^)1 < cjffM 
through a standard application of Cauchy-Schwartz in¬ 
equality. For the model of Barabasi-Albert we get A 2 = 
VL{N^) with 7 > 0 and for the cliques we get A 2 = A^. 
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General graphs: When is much larger than A^, the 
above proposition gives a tight relationship between the 
optimal reward and the average reward from its neighbor¬ 
hood. However, for general graphs this eigenvalue gap as¬ 
sumption is not valid. Motivated by (9), we assume that 
the smooth reward function satisfies the following weaker 
version for the general graphs. For all re < \N'i*\ 

\Fg{s,) - Fg(s“)| < c'Vfw/Xd+i. (10) 

These inequalities get progressively weaker in T and w 
and can be interpreted as follows. For small values of T, 
we have few rounds for exploration and require stronger 
assumptions on smoothness. On the other hand, as T 
increases we have the opportunity to explore and conse¬ 
quently the inequalities are more relaxed. This relaxation 
of the inequality as a function of the width w characterizes 
the fact that close neighborhoods around the optimal node 
provide better information about the optimal reward than a 
wider neighborhood. 

5.3. Algorithm: CheapUCB 

Below we present an algorithm similar to LinUCB (Li 
et al., 2010) and SpectralUCB (Valko et al., 2014) for re¬ 
gret minimization. The main difference between our algo¬ 
rithm and the SpectralUCB algorithm is the enlarged ac¬ 
tion space, which allows for selection of subsets of nodes 
and associated realization of average rewards. Note that 
when we probe a specific node instead of probing a subset 
of nodes, we get a more precise information (though noisy) 
about the node, but this results in higher cost. 

As our goal is to minimize the cost while maintaining a 
low regret, we handle this requirement by moving sequen¬ 
tially from the least costly probes to expensive ones as we 
progress. In particular, we split the time horizon into J 
stages, and as we move from state j to j + 1 we use more 
expensive probes. That means, we use probes with smaller 
widths as we progress through the different stages of learn¬ 
ing. The algorithm uses the probes of different widths in 
each stage as follows. Stage j = 1,..., J consists of time 
steps from to 2^ — 1 and uses of probes of weight j 
only. 

At each time step t = 1, 2,..., T, we estimate the value 
of a* by using /^-regularized least square as follows. Let 
{si := 7r(i), i = 1, 2,..., t} denote the probe selected till 
time t and = 1,2,...,t} denote the corresponding 

rewards. The estimate of a* denoted OLt is computed as 


Algorithm 1 CheapUCB 

1: Input: 

2: G: graph 

3: T: number of steps 

4: A, (5: regularization and confidence parameters 
5: R,c: upper bound on noise and norm of a 

6; Initialization: 

1: d ^ argmax{c/ : (d — l)Ad < T/ log(l + T/X)} 
8: /3 i — 2R^d log(l + T/A) + 2 log(l/(5) + c 
9: Vo ^ AL-hA/,So ^0,ro ^0 
10: for j = 1 ^ J do 
11: fort = 2^-^ ^min{2^ -l,T}do 

12: St ^ St-i + n-ist-i 

13: Vt ^ + st-is't-i 

14: Oit^V-^St 

15: St ^ argmax^g Sj-j+i (s'a* + ^l|s||y^-i) 

16: end for 

17: end for 


Theorem 2 Set J = \ log T ] in the algorithm. Let d he the 
effective dimension and A be the smallest eigenvalue of A. 
Let s[cx* G [—1,1] for all s G 5, the cumulative regret of 
the algorithm is with probability at least 1 — 6 bounded as: 

(i) If (5) holds and \d+i/^d ^ 0{d?), then 
Rt < {SR^/d log(l + T/X) + 2 log(l/(5) + 4c) 

X VdTlog(l+T/A) + c'd'^ log2(T/2) log(T/A + 1), 


(ii) If (5) and (10) hold, then 

Rt < {SR^/d log(l + T/X) + 2 log(l/(5) + 4c) 

X VdTlog(H-T/A) + c'dyi741og2(T/2) log(T/A + 1), 


Moreover, the cumulative cost of CheapUCB is bounded as 


j-i 


Ct <^2 

i=i 


2i-i 

J-J + 1 



1 

2 


Remark 1 Observe that when the eigenvalue gap is large, 
we get the regret to order ds/T within a constant factor sat¬ 
isfying the constraint (7). For the general case, compared 
to SpectralUCB, the regret bound of our algorithm in¬ 
creases by an amount ofcdxjT/2 log 2 (T / 2) log(T /A -fl), 
but still it is of the order dx/T. However, the total cost in 
CheapUCB is smaller than in SpectralUCB by an amount 
of at least T/4 + 1/2, i.e., cost reduction of the order ofT 
is achieved by our algorithm. 


Corollary 1 CheapUCB matches the regret performance 
of SpectralUCB and provides a cost gain of 0{T). 


OLt = arg min 

Q. 



s'iQot 



5.4. Computational complexity and scalability 

The computational and scalability issues of CheapUCB are 
essentially those associated with the SpectralUCB, i.e., ob¬ 
taining eigenbasis of the graph Laplacian, matrix inversion 
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Figure 1. Regret and Cost for Barabasi-Albert (BA) and Erdos-Renyi (ER) graphs with N=250 nodes and T = 100 


and computation of the UCBs. Though CheapUCB uses 
larger sets of arms or probes at each step, it needs to com¬ 
pute only N UCBs as | = N for all w. The i-th probe in 
the set Sw can be computed by sorting the elements of the 
edge weights and assigning weight 1/w to the first 

w components can be done in order N log N computations. 
As Valko et al. (2014), we speed up matrix inversion using 
iterative update (Zhang, 2005), and compute the eigenbasis 
of symmetric Laplacian matrix using fast symmetric diag¬ 
onally dominant solvers as CMC (Koutis et al., 2011). 

6. Experiments 

We evaluate and compare our algorithm with SpectralUCB 
which is shown to outperform its competitor LinUCB for 
learning on graphs with large number of nodes. To demon¬ 
strate the potential of our algorithm in a more realistic sce¬ 
nario we also provide experiments on Forest Cover Type 
dataset. We set 6 = 0.001, R = 0.01, and A = 0.01. 

6.1. Random graphs models 

We generated graphs from two graph models that are 
widely used to analyze connectivity in social networks. 
First, we generated a Erdos-Renyi (ER) graph with each 
edge sampled with probability 0.05 independent of oth¬ 
ers. Second, we generated a Barahdsi-Alhert (BA) graph 
with degree parameter 3. The weights of the edges of these 
graphs we assigned uniformly at random. 

To obtain a reward function /, we randomly generate a 
sparse vector a* with a small k N and use it to lin¬ 
early combine the eigenvectors of the graph Laplacian as 
/ = Qa*, where Q is the orthonormal matrix derived from 
the eigendecomposition of the graph Laplacian. We ran our 
algorithm on each graph in the regime T < A^. In the plots 
displayed we used N = 250, T = 150 and k = b. We 
averaged the experiments over 100 runs. 

From Figure 1, we see that the cumulative regret per¬ 
formance of CheapUCB is slightly worse than for Spec¬ 
tralUCB, but significantly better than for LinUCB. How¬ 
ever, in terms of the cost CheapUCB provides a gain of at 
least 30 % as compared to both SpectralUCB and LinUCB. 


6.2. Stochastic block models 

Community structure commonly arises in many networks. 
Many nodes can be naturally grouped together into a tightly 
knit collection of clusters with sparse connections among 
the different clusters. Graph representation of such net¬ 
works often exhibit dense clusters with sparse connection 
between them. Stochastic block models are popular in 
modeling such community structure in many real-world 
networks (Girvan & Newman, 2002). 

The adjacency matrix of SBMs exhibits a block triangular 
behavior. A generative model for SBM is based on con¬ 
necting nodes within each block/cluster with high probabil¬ 
ity and nodes that are in two different blocks/clusters with 
low probability. For our simulations, we generated an SBM 
as follows. We grouped = 250 nodes into 4 blocks of 
size 100, 60, 40 and 50, and connected nodes within each 
block with probability of 0.7. The nodes from the differ¬ 
ent blocks are connected with probability 0.02. We gener¬ 
ated the reward function as in the previous subsection. The 
first 6 eigenvalues of the graph are 0,3,4, 5,29,29.6,..., 
i.e., there is a large gap between 4th and 5th eigenvalues, 
which confirms with our intuition that there should be 4 
clusters (see Prop. 2). As seen from (a) and (b) in Figure 2, 
in this regime CheapUCB gives the same performance as 
SpectralUCB at a significantly lower cost, which confirms 
Theorem 2 (i) and Proposition 2. 

6.3. Forest Cover Type data 

As our motivation for cheap bandits comes from the sce¬ 
nario involving sensing costs, we performed experiments 
on the Forest Cover Type data, a collection of 581021 la¬ 
beled samples each providing observations on 30m x 30m 
region of a forest area. This dataset was chosen to match 
the radar motivation from the introduction, namely, we can 
view sensing the forest area from above, when vague sens¬ 
ing is cheap and specific sensing on low altitudes is costly. 
This dataset was already used to evaluate a bandit setting 
by Filippi et al. (2010). 

The labels in Forest Cover Type data indicate the domi¬ 
nant species of trees (cover type) in a given region region. 
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(a) Regret for SBM (b) Cost for SBM (c) Regret for Forest data (d) Cost for Forest data 

Figure 2. (a) Regret and (b) Cost for Stochastic block model with N=250 nodes and 4 blocks, (c) Regret and (d) Cost on the ‘Cottonwood’ 
cover type of the forest data. 


The observations are 12 ‘cartographic’ measures of the re¬ 
gions and are used as independent variables to derive the 
cover types. Ten of the cartographic measures are quantita¬ 
tive and indicate the distance of the regions with respect to 
some reference points. The other two are qualitative binary 
variables indicating presence of certain characteristics. 

In a forest area, the cover type of a region depends on 
the geographical conditions which mostly remain similar 
in the neighboring regions. Thus, the cover types change 
smoothly over the neighboring regions and likely to be con¬ 
centrated in some parts of forest. Our goal is to find the 
region where a particular cover type has the highest con¬ 
centrated. For example, such requirement arises in aerial 
reconnaissance, where an air borne vehicle (like UAV) col¬ 
lects ground information through a series of measurements 
to identify the regions of interests. In such applications, 
larger areas can be sensed at higher altitudes more quickly 
(lower cost) but this sensing suffers a lower resolution. On 
the other hand, smaller areas can be sensed at lower alti¬ 
tudes but at much higher costs. 

To find the regions of high concentration of a given cover 
type, we first clustered the samples using only the quantita¬ 
tive attributes ignoring all the qualitative measurements as 
done in (Filippi et al., 2010). We generated 2000 clusters 
(after normalizing the data to lie in the intervals [0 1]) us¬ 
ing /c-means with Euclidean distance as a distance metric. 
For each cover type, we defined reward on clusters as the 
fraction of samples in the cluster that have the given cover 
type. We then generated graphs taking cluster centers as 
nodes and connected them with edge weight 1 that have 
similar rewards using 10 nearest-neighbors method. Note 
that neighboring clusters are geographically closer and will 
have similar cover types making their rewards similar. 

We first considered the ‘ CottonwoodAVillow’ cover type 
for which nodes’ rewards varies from 0 to 0.068. We plot 
the cumulative regret and cost in (c) and (d) in Figure 2 for 
T = 100. As we can see, the cumulative regret of the Chea- 
pUCB saturates faster than LinUCB and its performance is 
similar to that of SpectralUCB. And compared to both Lin¬ 


UCB and SpectralUCB total cost of CheapUCB is less by 
35 %. We also considered reward functions for all the 7 
cover types and the cumulative regret is shown in Figure 3. 
Again, the cumulative regret of CheapUCB is smaller than 
LinUCB and close to that of SpectralUCB with the cost 
gain same as in Figure 2(d) for all the cover types. 


Forest Cover Type, N=2000 



Figure 3. Cumulative regret for different cover types of the forest 
cover type data set with 2000 clusters: 1- Spruce/Fir, 2- Lodge- 
pole Pine, 3- Ponderosa Pine, 4- CottonwoodAVillow, 5- Aspen, 
6- Douglas-fir, 7- Krummholz. 

7. Conclusion 

We introduced cheap bandits, a new setting that aims to 
minimize sensing cost of the group actions while attain¬ 
ing the state-of-the-art regret guarantees in terms of effec¬ 
tive dimension. The main advantage over typical bandit 
settings is that it models situations where getting the aver¬ 
age reward from a set of neighboring actions is less costly 
than getting a reward from a single one. For the stochastic 
rewards, we proposed and evaluated CheapUCB, an algo¬ 
rithm that guarantees a cost gain linear in time. In future, 
we plan to extend this new sensing setting to other settings 
with limited feedback, such as contextual, combinatorial 
and non-stochastic bandits. As a by-product of our analy¬ 
sis, we establish a lower bound on the cumulative 

regret for a class of graphs with effective dimension d. 
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8. Proof of Proposition 1 

Lor a given policy ir^cx'^^T, and a graph G define expected 
cumulative reward as 


Regret{T, tt, a*, G) = E 


■ T 

_t=i 


where St = 7r'{t)Q, and Q is the orthonormal basis matrix 
corresponding to Laplacian of G. Let Qd denote the family 
of graphs with effective dimension d. Define T- period risk 
of the policy tt 


tt) = max max [i?e^ref(T, tt, a*, G)] 
GeQd ot*e'R^ 

\\cx*\\a<c 


We first establish that their exists a graph with effective 
dimension d, and a class of smooth reward functions de¬ 
fined over it with parameters a*’s in a d-dimensional vec¬ 
tor space. 


Lemma 2 Given T, there exists a graph G £ Qd such that 


max i?e 5 fret(T, TT, a*, G) < Risk{T^7r) 
cx*en^ L J 


ll« I|a<C 


Proof: We prove the lemma by explicit construction of a 
graph. Consider a graph G consisting of d disjoint con¬ 
nected subgraphs denoted as Gj : j = 1, 2 ..., d. Let 
the nodes in each subgraph have the same reward. The set 
of eigenvalues of the graph are {0, Ai, • • • , X^-d}^ where 
eigenvalue 0 is repeated d times. Note that the set of eigen¬ 
values of the graph is the union of the set of eigenvalues 
of the individual subgraphs. Without loss of generality, as¬ 
sume that Ai > T/d log(T /A -f-1) (this is always possible, 
for example if subgraphs are cliques). Then, the effective 
dimension of the graph G is d. Since the graph separates 
into d disjoint subgraphs, we can split the reward function 
= Qol into d parts, one corresponding to each sub¬ 
graph. We write f j = QjCXj for j = 1,2,..., d, where 
is the reward function associated with Gj, Qj is the or¬ 
thonormal matrix corresponding to Laplacian of Gj, and 
cxi is a sub-vector of a corresponding to Gj. 

Write cxj = Qjfj- Since fj is a constant vector and, 
except for one , all the columns in Qj are orthogonal to 
fj, it is clear that cxj has only one non-zero component. 
We conclude that for the reward functions that is constant 
on each subgraphs a has only d non-zero components and 
lies d-dimensional space. The proof is complete by setting 
G = G 

Note that a graph with effective dimension d cannot have 
more than d disjoint connected subgraphs. Next, we re¬ 
strict our attention to graph G and rewards that are piece- 
wise constant on each clique. That means that the nodes 
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in each clique have the same reward. Recall that action set 
Sd consists of actions that can probe a node or a group of 
neighboring nodes. Therefore, any group action will only 
allow us to observe average reward from a group of nodes 
within a clique but not across the cliques. Then, all node 
and group actions used to observe reward from within a 
clique are indistinguishable. Hence, the Sd collapses to 
set of d distinct actions one associated with each clique, 
and the problem reduces to that of selecting a clique with 
the highest reward. We henceforth treat each clique as an 
arm where all the nodes within the same clique share the 
same reward value. 


and \dX\ denote the number of edges between the nodes in 
X and V\X. 

For all k > 0, k—way expansion constant is defined as 
pg{k) = min {max(/)(V*) : = 0 , |V*| 7 ^ O} . 

Let Pi < Pn denote the eigenvalues of the nor¬ 

malized Laplacian of G. 

Theorem 3 ((Gharan & Trevisan, 2014)) Let 5 > 0 and 

p{k + 1 ) > (1 -h e)p{k) holds for some /c > 0 . Then the 
following holds: 


We now provide a lower bound on the expected regret de¬ 
fined as follows 


^^(T, 7 r,G)=E Regret (t,!:, a*, 


( 11 ) 


where expectation is over the reward function on the arms. 


To lower bound the regret we follow the argument of Auer 
et al. (2002) and their Theorem 5.1, where an adversar¬ 
ial setting is considered and the expectation in ( 11 ) is 
over the reward functions generated randomly according 
to Bernoulli distributions. We generalize this construction 
to our case with Gaussian noise. The reward generation 
process is as follows: 


Without loss of generality choose cluster 1 to be the good 
cluster. At each time step t, sample reward of cluster 1 
from the Gaussian distribution with mean ^ ^ and unit 

variance. For all other clusters, sample reward from the 
Gaussian distribution with mean | and unit variance. 

The rest of the proof of the arguments follows exactly as in 
the proof of Theorem 5.1 (Auer et al., 2002) except at their 
Equation 29. To obtain an equivalent version for Gaussian 
rewards, we use the relationship between the Li distance 
of Gaussian distributions and their KL divergence. We then 
apply the formula for the KL divergence between the Gaus¬ 
sian random variables to obtain equivalent version of their 
Equation 30. Now note that, log(l — ^ (within 

a constant). Then the proof follows silmilarly by setting 
^ = ^/d/T and noting that the L 2 norm of the mean re¬ 
wards is bounded by c for an appropriate choice of A. 


9. Proof of Proposition 2 


Mfe/2 < p(k) < 0{k‘^)^/p^ (12) 

There exits a k partitions {V* : i = 1, 2, • • • ^k} of V such 
that for all i = 1, 2, • • • /c 

0(V^) < kp{k) and (13) 

0(G[V^]) > ep{kXl)/Uk (14) 

where (j){G[X]) denotes the CheegeFs constant (condun- 

tance) of the subgraph induced by X. 

Definition 3 (Isoperimetric number) 

0{G) = |min : \X\ < X/2^ . 

Let Al < A 2 ,..., < \n denotes the eigenvalues of the 
unnormalized Lapalcian of G.The following is a standard 
result. 


A2/2 < 6>(G) < ^2)^. (15) 

Proof: The relation A/c+i/A/^ > 0{kf) implies that 
Pk+i/Pk ^ 0{kf). Using the upper and lower bounds 
on the eigenvalues in (12), the relation > (1 -\- e)pk 
holds for some 5 > 1/2. Then, applying Theorem 3 we 
get /^-partitions satisfying (13)-(14). Let Li denote the 
Laplacian induced by the subgraph G[V-^ ] = (V-^ , ) for 

j = 1, 2, • • • /c. By the quadratic property of the graph 
Laplacian we have 


In the following, we first we give some definitions and re¬ 
lated results. 

Definition 2 (k-way expansion constant (Lee et al., 2012)) 

Consider a graph G and X cV let 

where V{X) denote the sum of the degree of nodes in A' 


fLf= ^ Uu-fv? ( 16 ) 

{u,v)^£ 

k 

= E E (fu-fvf (17) 

i=i {u,v)eSj 

= 

i=i 


( 18 ) 
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where fj denote the reward vector on the induced subgraph 
Gj := G[V^] In the following we just focus on the optimal 
node. The same arguments holds for any other node. With¬ 
out loss of generality assume that the node with optimal 
reward lies in subgraph Gi for some 1 < I < d. From the 
last relation we have f'lhfi < C. The reward functions on 
the subgraph Gi can be represented as fi = QiCXi for some 
OLi, where Qi satisfies Li = Q[KlQi and A/ denotes the 
diagonal matrix with eigenvalues of A/. We have 


|Fg(s*) - Fg((s“)| = - Fg,(s“)| 




< 


< 


< 


< 


< 


< 


— , From Chauchy-Schwarz 

v/A^ 


y/^C 

Wi) 

y/^c 

lAky/^c 
£p{k -h 1) 
b6ky/^c 


From (15) 

Using <9(Gz) > 0(Gz) 
FromTh.l,Eq. (14) 
FromTh.l,Eq. (12) 


A^/c+i 

56kKy/^c 

A/c+i 


Using/i/c+i > Xk+i/f^- 


This completes the proof. 


Lemma 5 Let ||q :*||2 ^ c. Then, with probability at least 
1 — 5, for allt >D and for any x G 'RT' we have a* G Gt 
and 


|x- (dt - a*)| < ||x||v-i/3. 

Lemma 6 Let d be the effective dimension and T be the 
time horizon of the algorithm. Then, 


log 


det(UT+i) 

det(A) 


< 2d\og 



10.1. Proof of Theorem 2 

We first prove the case where degree of each node is at least 
logT. 

Consider step t G [2-^“^, 2-^ — 1] in stage j = 1, 2, • • • J — 1. 
Recall that in this step a probe of width J—j +1 is selected. 
Write Wj := J — j + 1, and denote the probe of width 
J — j + 1 associated with the optimal probe s* as simply 
and the corresponding GFT as . The probe selected 
at time t is denoted as s^. Note that both St and lie in 
the set For notational convenience let us denote 


Kj) 


c'\/T{J — j + 1)/ Ad +1 when (10) holds 
c'd/Xd-\-i when (9) holds. 


10. Analysis of CheapUCB 

For a given confidence parameter 5 define 


The instantaneous regret in step t is 




= 2R 




+ 2 log - + c, 


and consider the ellipsoid around the estimate 6^t 


Gt = {a : ||at - a||v, < P}. 


We first state the following results from (Abbasi-Yadkori 
et al., 2011), (Dani et al., 2008), and (Valko et al., 2014) 

Lemma 3 (Self-Normalized Bound) Let^t = El=i 
and A > 0. Then, for any (5 > 0, with probability at least 
1 — 6 and for all t > 0, 

ll^llv;- < /?• 

Lemma 4 Let Vq = A/. We have: 


rt = s* ■ a* - St ■ a* 

< • a* + h{j) -Sf a* 

= sr^‘-(a*-dt) + sGdt + ^||Cllv-^ 
-/3||C‘|lv-- St ■ a* + Mj) 

< s* ^ • (q* — dj) + S( • d( + /3||S( IIY-i 

= ■ (a* - dt) + St ■ (dt - a*) + /?||st||y-i 

-/3||C||v-i+/t(i) 

< ^||sr|lv-+^||st||v-i+/3||st|lv- 
-/3||C|Iv- + Mj) 

= 2^||St ll-y-l + 


det(Vt) '^ 11 - II ir II ^91 det(Vt+i) We used (9)/(10) in the first inequality. The second inequal- 

det(A/) “ ^ l|Si|lvt-i — og ity follows from the algorithm design and the third inequal¬ 

ity follows from Lemma 5. Now, the cumulative regret of 
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the algorithm is given by 


have 


Rt 

< 


< 


< 


J 2^-1 

X] H ™n{2,2/3||st||^-i+/i(i)} 

j=l t=2i-i 

J 2^ — 1 J—1 2^ — 1 

^ ^ min{2,2A||st||^-i} + ^ ^ Kj) 

j=l t=2o-^ j=l t=2J-i 


T 


J-1 


X]min{2,2/3t||st||^-i} + h{j)2^ \ 

t=l j=l 


Note that the summation in the second term includes only 
the first J — 1 stages. In the last stage J, we use probes 
of width 1 and hence we do not need to use (9)/(10) in 
bounding the instantaneous regret. Next, we bound each 
term in the regret separately. 

To bound the first term we use the same steps as in the 
proof of Theorem 1 (Valko et al., 2014). We repeat the 
steps below. 


y]min{2,2/3||st||y^-i} 

t=l 

T 

< (2 + 2/3)y]inin{l, ||st||v-i} 

t = l 


< 

< 

< 


(2 + 2p) 


\ 


ry]min{l,/3t||st||v-i}2 

t=l 


2(1 + /3)v/2Tlog(VT+i|/|A|) 
4(1 + p)^jTd\og{l+T/\) 


(19) 

( 20 ) 


< ( 8i?^ /2 log i + dlog ( 1 + y ) + 4c + 4 


x.\jTd log ( 1 + — 


2l-V'x/T(J-i + l) ^^2^-l^/^c' 

i=i 


Ad+i 


Ad+i 


2i°S2^-ic'\/T 




c'VT{T/2) 


- Ad+i ^^(T/dlog(T/A + l)) 

< dc'y7741og2(T/2)log(T/A + l). 


In the second line we applied the definition of effective di¬ 
mension. 

10.3. For the case when \d+i/^d > 0{d?) 

For the case \d+i/\d ^ 0{d?) weuse/i(jf) = c'd/\d+i. 

2’^-^c'd 
Ad+i 

< c'd2iog2(r/2)iog(r/A + i). 


2i-^c'd ^ 

h ^‘'+1 " 


Now consider the case where minimum degree of the nodes 
is 1 < a < log T. In this case we modify the algorithm 
to use only signals of width a in the first logT — a + 1 
stages and subsequently the signal width is reduced by one 
in each of the following stages. The previous analysis holds 
for this case and we get the same bounds on the cumulative 
regret and cost. When a = 1, CheapUCB is same as the 
SpectralUCB, hence total cost and regret is same as that of 
SpectralUCB. 

To bound the total cost, note that in stage j we use signals 
of width J — j H- 1. Also, the cost of a signal given in (2) 
can be upper bounded as C(s^) < Then, we can upper 
bound total cost of signals used till step T as 
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2i-i 

J -j + 1 
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2 ^ 2 
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2 V 2 


X ^-1 


T 
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3T 1 

T ~ 2‘ 


We used Lemma 4 and 6 in inequalities (19) and (20) 
respectively. The final bound follows from plugging the 
value of /3. 

10.2. For the case when (10) holds: 

For this case we use h{j) = c'\/T{J — j + l)/\d+i. First 
observe that 2^~^h{j) is increasing in 1 < j < J — 1. We 
















